In this document, I describe a possible approach to quantifying the (probabilistic) calibration of forecasts of an ordinal variable. Our running example is state-level forecasts of the change in rates of influenza hospitalizations per 100k population that are being collected by the FluSight forecast hub in the 2023/24 season. Forecasts assign probabilities to the following categories describing the magnitude and direction of change in influenza: “large decrease”, “decrease”, “stable”, “increase”, “large increase”. Noting that these categories are ordered, we assign them the numeric labels 1, 2, 3, 4, and 5.
A predictive pmf from a particular forecaster for a “forecasting task” indexed by \(i\) is denoted by \(f_i\), and it assigns probabilities to these categories that are non-negative and sum to 1: \(f_i(k) \geq 0\), \(\sum_{k = 1}^5 f_i(k) = 1\). For our purposes, a forecasting task corresponds to a combination of location, reference date, and forecast horizon. For each such task, we will eventually observe the outcome \(y_i \in \{1, \ldots, 5\}\), which is the label of the hospitalization rate change class that occurred. For example, \(f_i(y_i)\) is the probability the foreaster assigned to the observed category. The predictive pmf \(f_i\) can also be converted into a predictive cdf \(F_i\) via \(F_i(k) = \sum_{j \leq k} f_i(k)\). \(F_i(y_i)\) is then the predicted probability that the rate change category would be less than or equal to the observed category. We define \(F_i(0) = 0\).
The proposed approach to measuring the calibration of these forecasts has two steps:
We’ll briefly recap these steps below, but also direct the reader to the following references:
For a continuous random variable \(Y\) with predictive cdf \(F\), the PIT value is \(Z = F(Y)\). In the setup of Ranjan and Gneiting, both \(Y\) and \(F\) are regarded as being random variables. When both are “realized”, we can compute an “observed PIT value” \(z = F(y)\) (where now we have a specific forecast distribution \(F\), but we don’t have good notation to distinguish it from the random forecast above).
Note that \(F(y)\) is the predicted probability that \(Y \leq y\), so it falls in the interval \([0, 1]\). The forecast \(F\) is probabilistically calibrated if its PIT \(Z\) is uniformly distributed on this interval: \(Z \sim Unif(0, 1)\).
In a setting where the random variable \(Y\) is not continuous, an adjustment to the computation of the PIT value is required in order to retain the result that \(Z \sim Unif(0, 1)\) for a probabilistically calibrated forecast. Here is a statement of Def. 2.6. in Ranjan and Gneiting that has been adapted for our (simpler, less general) setting: \[\begin{align*} Z &= F(Y - 1) + V \cdot [F(Y) - F(Y - 1)] \text{ where} \\ V &\sim Unif(0, 1) \end{align*}\] In this definition, the PIT score is distributed uniformly at random between the predicted cumulative probability at the observed category and the next-lowest category.
As a concrete example, suppose the forecast has the cumulative probabilities \(F(1) = 0.1\), \(F(2) = 0.2\), \(F(3) = 0.5\), \(F(4) = 0.8\), \(F(5) = 1\), and we observe \(y = 4\) (“increase”). The PIT value is selected uniformly at random on the interval \([0.5, 0.8]\).
The use of random numbers in the computation of the PIT values introduces some (extra) random noise into measures of forecast calibration. This should have limited impact in large sample sizes, but could be undesirable. If multiple runs of the procedure produce substantively different results, indicating sensitivity to this random number generation, multiple Monte Carlo realizations of PIT values could be generated for each forecast to reduce the impact of this randomness on scores. In the below, I generate 100 PIT values for each forecast.
As discussed above, if a forecaster’s predictions are probabilistically calibrated, their PIT values follow a uniform distribution on the interval from 0 to 1. One way to examine calibration of a forecaster is to examine a histogram of their PIT values across many predictions.
If a single numeric summary of calibration is desired, one option is to compute a measure of divergence of the distribution of PIT values from a \(Unif(0, 1)\) distribution. For this purpose, Rumack et al. propose the use of KL divergence, or equivalently the negative entropy of the PIT distribution. I find it easier to think about divergence than entropy, so I’m using that here. A low divergence indicates good probabilistic calibration.
Following the procedure in Rumack et al., we’ll estimate this entropy based on a histogram representation of the distribution of PIT values with 100 equal-sized bins on the interval from 0 to 1. The number of histogram bins used to compute the KL divergence is a tuning parameter of the measure of calibration, and there is no specific jsutification for the choice of 100 bins. Below I illustrate that rankings of forecast calibration by model are somewhat sensitive to this choice; e.g., one model is ranked 10/25 when 100 bins are used, but 3/25 when 20 bins are used. This is unfortunate…
library(hubUtils)
library(tidyverse)
library(DT)
library(plotly)
set.seed(42)
hub_path <- "../../../epi/flu/FluSight-forecast-hub"
hub_con <- connect_hub(hub_path)
forecasts <- hub_con |>
dplyr::filter(
output_type == "pmf"
) |>
dplyr::collect() |>
as_model_out_tbl() |>
dplyr::filter(horizon >= 0,
reference_date >= "2023-10-14",
location != "US",
location != "78")
Observed category levels – code borrowed from a draft report by Sarabeth Mathis.
# load location data
location_data <- readr::read_csv(file = "https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/main/auxiliary-data/locations.csv") %>%
dplyr::select(location,location_name, count_rate1, count_rate2, count_rate2p5, count_rate3, count_rate4, count_rate5)
# load target and merge with location data
# filter most recent target
# set rate, diff, criteria
weekly_data_all <- readr::read_csv(file = "https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/main/target-data/target-hospital-admissions.csv") %>%
filter(date >= as.Date("2023-10-01")) %>%
select(-c(`...1`, location))%>%
dplyr::inner_join(location_data,
by = c("location_name"))
weekly_rate_differences <- weekly_data_all %>% group_by(location_name) %>% arrange(date) %>%
mutate(rate_diff0 = weekly_rate - lag(weekly_rate, 1),
rate_diff1 = weekly_rate - lag(weekly_rate, 2),
rate_diff2 = weekly_rate - lag(weekly_rate, 3),
rate_diff3 = weekly_rate - lag(weekly_rate, 4)) %>%
ungroup() %>%
pivot_longer(cols = c(rate_diff0, rate_diff1, rate_diff2, rate_diff3), names_to = "horizon", names_prefix = "rate_diff", values_to = "rate_diff", names_transform = list(horizon = as.integer)) %>%
mutate(category = case_when(value < 10 | horizon == 0 & rate_diff < 1 ~ "stable",
horizon == 0 & rate_diff > 2 ~ "large_increase",
horizon == 0 & rate_diff < -2 ~ "large_decrease",
horizon == 0 & rate_diff >= 1 ~ "increase",
horizon == 0 & rate_diff <= -1 ~ "decrease",
value < 10 | horizon == 1 & rate_diff < 1 ~ "stable",
horizon == 1 & rate_diff > 3 ~ "large_increase",
horizon == 1 & rate_diff < -3 ~ "large_decrease",
horizon == 1 & rate_diff >= 1 ~ "increase",
horizon == 1 & rate_diff <= -1 ~ "decrease",
value < 10 | horizon == 2 & rate_diff < 2 ~ "stable",
horizon == 2 & rate_diff > 4 ~ "large_increase",
horizon == 2 & rate_diff < -4 ~ "large_decrease",
horizon == 2 & rate_diff >= 2 ~ "increase",
horizon == 2 & rate_diff <= -2 ~ "decrease",
value < 10 | horizon == 3 & rate_diff < 2.5 ~ "stable",
horizon == 3 & rate_diff > 5 ~ "large_increase",
horizon == 3 & rate_diff < -5 ~ "large_decrease",
horizon == 3 & rate_diff >= 2.5 ~ "increase",
horizon == 3 & rate_diff <= -2.5 ~ "decrease")) %>% select(date, location_name, location, horizon, rate_diff, category)
forecasts <- forecasts |>
mutate(
output_type_id = factor(
output_type_id,
levels = c("large_decrease", "decrease", "stable", "increase", "large_increase"),
ordered = TRUE),
cat_lvl = as.integer(output_type_id)
) |>
group_by(model_id, location, reference_date, horizon, target_end_date) |>
arrange(cat_lvl) |>
mutate(cdf = cumsum(value)) |>
mutate(cdf_lower_cat = dplyr::lag(cdf, 1, default = 0))
Here’s an example showing the results of the calculations we just did for one forecast:
forecasts |>
ungroup() |>
filter(model_id == "FluSight-ensemble", reference_date == "2024-01-13",
horizon == 3, location == "45") |>
select(output_type_id, cat_lvl, value, cdf, cdf_lower_cat)
## # A tibble: 5 × 5
## output_type_id cat_lvl value cdf cdf_lower_cat
## <ord> <int> <dbl> <dbl> <dbl>
## 1 large_decrease 1 0.360 0.360 0
## 2 decrease 2 0.130 0.489 0.360
## 3 stable 3 0.196 0.686 0.489
## 4 increase 4 0.0752 0.761 0.686
## 5 large_increase 5 0.239 1 0.761
We merge observed target categories and forecasts (dropping forecasts for unobserved categories) and compute the PIT values for each forecast. Note that we generate 100 PIT values for each forecast in an attempt to limit Monte Carlo variability of the results.
pit_values <- inner_join(
forecasts,
weekly_rate_differences,
by = join_by(location == location, reference_date == date, horizon == horizon,
output_type_id == category)) |>
mutate(
pit = list(runif(n = 100, min = cdf_lower_cat, max = cdf))
)
Here’s the result for our example:
selected_forecast <- pit_values |>
filter(model_id == "FluSight-ensemble", reference_date == "2024-01-13",
horizon == 3, location == "45") |>
select(output_type_id, cat_lvl, value, cdf, cdf_lower_cat, pit)
print(selected_forecast)
## # A tibble: 1 × 11
## # Groups: model_id, location, reference_date, horizon, target_end_date [1]
## model_id location reference_date horizon target_end_date output_type_id cat_lvl value cdf cdf_lower_cat pit
## <chr> <chr> <date> <int> <date> <chr> <int> <dbl> <dbl> <dbl> <list>
## 1 FluSight-ensemble 45 2024-01-13 3 2024-02-03 stable 3 0.196 0.686 0.489 <dbl [100]>
print(selected_forecast$pit)
## [[1]]
## [1] 0.6722263 0.6767345 0.5577674 0.5806787 0.6461267 0.6798668 0.6568912 0.6119024 0.6342172 0.5307141 0.6758339 0.6087311 0.5686434 0.5560664
## [15] 0.5129542 0.5873288 0.6558933 0.6196645 0.6259718 0.6654867 0.5356970 0.6683391 0.5187161 0.6375220 0.5448716 0.6309091 0.4957413 0.6143863
## [29] 0.6355299 0.4938238 0.5815303 0.4910886 0.6577531 0.5958244 0.5914874 0.6207717 0.6565250 0.5997045 0.4932691 0.5345279 0.4953668 0.5998668
## [43] 0.6650037 0.5353887 0.6696710 0.6745493 0.6474108 0.6551168 0.6045148 0.5822702 0.5492120 0.6490893 0.6349627 0.5733546 0.5671044 0.6825130
## [57] 0.5972106 0.5935265 0.6268197 0.6293368 0.6064260 0.5973937 0.5617396 0.5937489 0.6099248 0.5134852 0.5092668 0.6673691 0.6408617 0.5782751
## [71] 0.5747717 0.5342584 0.6578202 0.5697919 0.6708846 0.5119862 0.5581330 0.5279459 0.6672034 0.5650749 0.6148494 0.5393657 0.6028219 0.6831407
## [85] 0.6218009 0.5072142 0.5558145 0.6498640 0.5952550 0.6195969 0.5470093 0.6699017 0.4976268 0.4924220 0.5869367 0.6712876 0.5620945 0.6328929
## [99] 0.4929889 0.5218764
One approach to summarizing the pit values: histograms for each forecaster:
ggplot(data = unnest(pit_values, cols = "pit")) +
geom_histogram(mapping = aes(x = pit, y = after_stat(density)), boundary = 0) +
facet_wrap( ~ model_id)
A few example interpretations:
We can also compute a numeric summary of how far each model’s distribution of PIT scores is from uniform, here using KL divergence based on 100 bins. A lower divergence from the uniform distribution indicates better probabilistic calibration.
get_pit_kl_div_by_model <- function(pit_values, num_bins) {
pit_kl_div_by_model <- pit_values |>
unnest(cols = "pit") |>
group_by(model_id) |>
summarize(
pit_kl_div = KL.empirical(
hist(pit, breaks = seq(from = 0.0, to = 1.0, by = 1 / num_bins), plot = FALSE)$counts,
length(pit) * rep(1 / num_bins, num_bins)
)
) |>
arrange(pit_kl_div)
return(pit_kl_div_by_model)
}
datatable(get_pit_kl_div_by_model(pit_values, num_bins = 100))
The results look a little different when using 20 bins instead of 100:
datatable(get_pit_kl_div_by_model(pit_values, num_bins = 20))
Here’s a look at how ranks change as a function of the number of bins used (lower ranks, toward the bottom of the plot, indicate better calibration):
kl_div_by_model <- purrr::map(
seq(from = 10, to = 200, by = 10),
function(num_bins) {
get_pit_kl_div_by_model(pit_values, num_bins = num_bins) |>
mutate(
rank = rank(pit_kl_div),
num_bins = num_bins
)
}) |>
purrr::list_rbind()
p <- ggplot(data = kl_div_by_model) +
geom_line(mapping = aes(x = num_bins, y = rank, color = model_id))
ggplotly(p)
Maybe a reasonable option here is to just choose a fairly large number of bins? There aren’t that many rank switches after 100 bins…
As another alternative, it might be possible to skip the sampling of PITs and the binning here, and get to an exact computation of the measure of divergence we’re after. For a single forecast, the PIT value follows a uniform distribution: \(Z_i \sim Unif(l_i, u_i)\) as described above. Then collecting across all forecasts in our evaluation set, if I draw a single PIT score, that score comes from a mixture of uniform distributions. It may be possible to arrive at an exact result for the divergence of that mixture of uniforms from a \(Unif(0, 1)\) distribution.